Contamination detection and microbiome exploration with GRIMER

Abstract Background Contamination detection is a important step that should be carefully considered in early stages when designing and performing microbiome studies to avoid biased outcomes. Detecting and removing true contaminants is challenging, especially in low-biomass samples or in studies lacking proper controls. Interactive visualizations and analysis platforms are crucial to better guide this step, to help to identify and detect noisy patterns that could potentially be contamination. Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature, could help to discover and mitigate contamination. Results We propose GRIMER, a tool that performs automated analyses and generates a portable and interactive dashboard integrating annotation, taxonomy, and metadata. It unifies several sources of evidence to help detect contamination. GRIMER is independent of quantification methods and directly analyzes contingency tables to create an interactive and offline report. Reports can be created in seconds and are accessible for nonspecialists, providing an intuitive set of charts to explore data distribution among observations and samples and its connections with external sources. Further, we compiled and used an extensive list of possible external contaminant taxa and common contaminants with 210 genera and 627 species reported in 22 published articles. Conclusion GRIMER enables visual data exploration and analysis, supporting contamination detection in microbiome studies. The tool and data presented are open source and available at https://gitlab.com/dacs-hpi/grimer.

The main point of confusion I'm concerned about is regarding the "common contaminants". It's not convincing that you can just classify a taxon as a contaminant regardless of what environment is being profiled. Also, under this approach, if a taxon is identified once as a contaminant in an earlier study, would it then be classified as a contaminant in all datasets processed by GRIMER? This would mean that a lot of highabundance taxa in certain environments would be wrongly thrown out. For instance, you can imagine high-abundance taxa on the human skin might be more likely to be contaminants during sequencing preparation, but of course many researchers are very interested in profiling the skin microbiome. I think the authors realize this, but I'm concerned that typical users may not appreciate this point. I think explicit discussion of this point in the discussion is needed and also an example of how this might look in practice (e.g., if skin microbiome samples were input to GRIMER, as part of a larger tutorial that could be online [see next point], would help avoid this mistake).
---Answer ---Thank you for the valuable suggestion. In our own experience, contamination can be very difficult to detect and looking at many aspects can lead to better decision. The use of the common contaminants should be just one evidence supporting contamination. We clarified the intended use of the common contaminant list in the main article: "The idea behind compiling this list is to detect which taxa is the most recurrently identified as contaminant in diverse conditions, providing a guideline and consensus for further studies. Entries on this list are not strictly considered a contaminant and should not be used alone to define contamination in a study. However, it serves as an additional evidence supporting it, especially if entries are highly recurrent (Table 4) and corroborate with additional lines of evidence." We also expanded and clarified the use of such annotations in the discussions: "In addition to the GRIMER software, we compiled and provided in this work a list of common taxa contaminants based on 22 publications (Table 3). Many of the reported contaminants are recurrent in diverse studies, pointing to a consensus for some taxa (Table 4) as a probable contaminant. Taxa in this list cannot be strictly considered a contaminant by itself. However, they can corroborate suspicious contamination discovered via several other lines of evidence without the extra effort of researching the literature. The presented list is not comprehensive but a first step to centralize and standardize re-occurring contaminants described in the literature. We expect this list to incrementally grow overtime as more evidence of kit and laboratory contamination becomes available. The information of common contaminants is a valuable resource to aid contamination detecting and we are willing to keep and extend it. Improvements to the list and suggestions of further candidate taxa can be provided via the GRIMER repository at https://github.com/pirovc/grimer/. As a future work, the list can be associated with study details as biome, extraction kit and methodology to be further queried and integrated in more details.
Additionally to the aforementioned common contaminants, GRIMER can also use general lists of custom organisms to annotate samples. In this work and by default, human-related organisms commonly occurring in human skin, oral and nasal cavities as well as face and other human limbs are used since they can be external sources of contamination. Those lists can be easily provided as taxonomic identifiers or names to GRIMER. If the target study conflicts with any of those environments (e.g. study of human skin), one could simply remove the related entries from the configuration files. More details and examples on how to perform this can be found in the online documentation." We also included in the newly created documentation instructions on how to deal with overlapping studies based on our contamination lists: "Common contaminants compiled from the literature and human-related possible sources of contamination are available in the GRIMER repository. For more information, please refer to the pre-print. If the target study overlaps with some of those annotation (e.g. study of human skin), related entries can be easily removed from the provided files to not generate redundant annotations." --------------The authors do a great job of walking through some results in the text, but more documentation is needed for the reports. The authors should include a basic tutorial that provides example input files and then walks through each individual tab. This could done all through text with screenshots of the GRIMER, or perhaps with a video tutorial. In addition, for someone just opening the example reports, I'm sure they will be wondering what data was produced by GRIMER (e.g., they might wrongly think GRIMER did the taxonomic classiciation) and what data was needed as input.
---Answer ---An online documentation page was created for running GRIMER as well as an user manual for GRIMER reports, with a walk through each tab and plot. They can be found in the following page: https://pirovc.github.io/grimer/ -------------- The authors should expand on how the correlation step is used to identify contaminants. There is great interest in identifying clusters of co-occurring taxa, so identifying a cluster of 9 genera in Figure 5 doesn't seem like evidence of contamination to me. Perhaps it is when considered with other lines of evidence though, but this should be made clearer. Currently this legend implies that it alone points to reagent-derived contamination ---Answer ---Indeed, the correlation alone can not define contamination and the information was misleading. This is now fixed and clarified in the main text: "Symmetric proportionality coefficient (rho correlation) between top 50 most abundant genera in the KatharoSeq data. Positive correlation values (between 0 and 1) are displayed in red. Negative correlation values (between -1 and 0) are displayed in blue. Highly correlated matrix among 9 genera (dark red) points to reagent-derived contamination, when considered with other lines of evidence ( Figure 6)" -------------- The figure text needs to be increased in size. Using more panels split across additional rows and removing unnecessary info (e.g., not all control categories need to be shown in Figure 1) would make these figures easier to interpret. I realize that you were hoping to use the raw GRIMER figures, but based on the current display items it does not seem like they are publication ready.
---Answer ---All figures were re-generated, with increased text size. Some of them were split in further panels for better visualization.
The acronym WGS generally refers to "whole genome sequencing" (i.e., for single isolate organisms) not "whole metagenome sequencing". The standard acronym for the latter case would be "MGS", for "metagenomics". Also, the term "shotgun metagenomics sequencing" is mostly commonly used in this context, I've never come across "whole metagenome sequencing" before. Either way, "WGS" will mislead casual readers with the current usage, so this should be changed on your website and in the manuscript.
---Answer ---The term was changed in the manuscript to MGS, referring to metagenomics as well as in the pictures and reports.
The taxa parsing capabilities sound like they will save a lot of tedious, manual data mapping! Just checking -how does it perform with new taxa names / typos? ---Answer ---The taxonomy parsing and conversion are enabled with MultiTax (https://github.com/pirovc/multitax). New names are not a problem, since GRIMER will always download the latest taxonomy available when running and automatically find and update them. Typos are not covered and entries will still be included in the report, just without any taxonomic connections (annotations, MGnify).
--------------Text edits L11 -"are challenging task" should be "is challenging" Done L12 -can remove "by design" Done L12 -"helping to" should be "to help" Done L13 -"can potentially be a source" I think should be "that could reflect" Changed to: "that could potentially be contamination" L14 -"evidences" should be "evidence" Done L13 + L14 -Unclear what is meant by "external evidences, aggregation of methods and data and common contaminant" -should be clarified Changed to "Additionally, external evidence, like aggregation of several contamination detection methods and the use of common contaminants reported in the literature could help to discover and mitigate contamination." L15 -"that perform" should be "that performs" Done L17 -"towards contamination detection" should be something like "to help detect contamination" Done L41 -"hypothesis" should be "hypotheses" Done L42/43 -"analysis can hardly be fully" should be something like "the required analysis is difficult to fully…" Changed to: "After measurements are obtained, hypotheses are validated through data mining and statistical analysis. This step is mostly exploratory and specific to the hypotheses and research questions pursued and the required analysis is difficult to be fully automatized." L56 -"technicians body" should be "a technician's body" Done L60 -"strongly affects environmental" should be "especially environmental," (note comma) Done L64 -"ideal scenario for an" should be "an ideal scenario for" Done L67 -"not to bias measurements and not to" should be reworded, possibly as: "to not bias measurements and to ensure that bias is not propagated into databases" Done L75 -"were proposed. They are " should be "have been proposed. These are" Done L77 -"among others" should be ", and others" (note comma) Done L79 -"increase in costs" should be "the required increase in costs" Done L88 -add "a" before focus Done L90, L196, L265, and elsewhere -"evidences" should be "evidence" Done L99, L104, L117, and possibly elsewhere -"analysis" should be "analyses" (when plural) Done L106 -"each samples/compositions" should be "each sample/composition" Done L110 -add "a" before taxonomy database and "the" before "DNA concentration" Done L132 -"specially" should be "especially" Done L134 -remove "a" before "the" Done L151 -add "of" after "thousands" Done L182 -"is" should be "are" Done L196 -"evidences" should be "evidence". And rather than "Evidences towards" it would be correct to say "Evidence for" or "Evidence supporting" Done L208 -add "the" before "overall" Done L246/247 -"generated several studies and investigations" should be something like "motivated several investigations" Changed to: "The attempt to detect and describe a possible human placental microbiome has motivated several studies and investigations" L248 -should be something like "from the maternal and fetal sides" Done L279 -remove "a" Done (L278) L280 -Add "the" before "Jet" Done L284 -capitalize "Qiita" and re-word "Pick closed-reference OTUs with 97% annotated with greengenes taxonomy" Done and changed to: "We downloaded the OTU table and metadata from KatharoSeq evaluations for the 16S rRNA analyses available in Qiita in the following configuration: reads trimmed at 150bp and classified using closed-reference OTUs clustered at 97% similarity annotated with the greengenes taxonomy." L293 -Should be "Furthermore" rather than "Further" Done L295 -I think it should be "with low and high human exposure, respectively"? Or do you mean they both have highly variable exposure? Done L297 -"could be a also an" should be "could be driven by an" Done L300 -"against" should be "and" Done L304 -"correlated genus" should be "correlated genera" (and in other cases, such as in the Fig 5 and 6 legends, where "genus" should be plural version, i.e., "genera") Done in all occurrences L305 -"Such pattern" should be "Such a pattern" Done L307 -Should be "groups" rather than "organisms groups", or just "genera" as I believe each is a genus Done L313 -Remove "a" Done "taxa is abundant" should be "This taxon is abundant" and "inversely correlate" should be "inversely correlated". "a contamination evidence" should be "potential contamination" Done Reviewer #2: Piro and Renard introduce GRIMER, a tool that automates microbiomerelated analyses and creates rich, offline-supported report that can be shared with collaborators or hosted online. I think that they gave a great summary of the problem of contamination in the microbiome field, and clearly explain the gap that their software fills. They exhibit GRIMER on previously published datasets, which are available to view online. Overall, I'm very impressed with the dashboard-it looks great, is easy to explore datasets, and highly portable. I can certainly see myself using GRIMER on some of my future datasets, and I have no doubt that it can be a valuable tool for others in the field. I do however think that the documentation and usability of the tool can be improved, and I give some suggestions below. Addressing these issues will, in my opinion, lead to a wider adoption of the tool by researchers in the field. Usability: I managed to test GRIMER on a 16S amplicon dataset, but given the sparsity of the documentation, this took me a little longer than expected (in addition to quite a few steps), and I think that there are improvements that could be made to make it easier for people to use GRIMER from formats that people commonly generate.
---Answer ---An online documentation page was created for running GRIMER as well as an user manual for GRIMER reports, with a walk through each tab and plot. They can be found in the following page: https://pirovc.github.io/grimer/ --------------For example, QIIME2 is perhaps the most used 16S amplicon analysis pipeline, so the ability to import directly from .qza files (e.g. table.qza, taxonomy.qza) would give GRIMER much greater reach. If this is beyond the scope to incorporate within the GRIMER codebase, at least provide the exact code needed in the documentation for people to export their .qza files to files compatible with GRIMER. Likewise from phyloseq, a commonly used R package for microbiome analyses. Could some documentation/code be added about how best to export phyloseq objects to a format that GRIMER can handle? ---Answer ---Thanks for the suggestion. We included in the manual a guide on how to run GRIMER from commonly used tools, with QIIME2 and phyloseq included.
--------------I mostly analyse shotgun metagenomic datasets (genome-resolved), and I foresee more users using these types of data in the future. Therefore, the ability to parse gtdbtk outputs directly would be very helpful. Perhaps have a flag --gtdb that parses the 'gtdbtk.bac120.summary.tsv' and 'gtdbtk.ar53.summary.tsv' files. Following on from this, CoverM (https://github.com/wwood/CoverM) is quite commonly used for generating final MAG count tables (.tsv), so the ability to import them directly would be a really nice quality-of-life addition, and something that would not require much coding to accomplish. I believe that these adjustments will make the tool far more accessible for everyday users and increase the adoption of GRIMER by the wider community.
---Answer ---Thanks for the suggestion. We included in the documentation several ways to run GRIMER via .biom or .tsv files. Further, we made all base files from the analysis of the manuscript available and provided the code used to generate them (https://pirovc.github.io/grimer/examples/). We believe that those methods are generalized enough and will cover the usage for users comming from gtdb-tk and CoverM.
--------------For the actual report, if possible, I would like the ability to export ASVs/features/MAGs from the report that the user thinks are contaminants. This could be a list that the user could copy/paste, or the direct export of a .txt/.tsv. Perhaps the user could tick a box next to the ASVs/features/MAGs to save them to a list/viewer? The reason for this is that the logical next step I see after using GRIMER is to go back to your dataset and filter out the putative contaminant ASVs/features/MAGs. Being able to produce such a list will make subsequent filtering by the user easier.
---Answer ---Thanks for the valuable request. An export button was added to the Overview and Samples panels, where the user can easily export selected or all items of the tables for further usage.
--------------I couldn't get decontam to work with my dataset, here was the error: raise KeyError(f"None of [{key}]  ---Answer ---It would be nice to have it reported in the repository so I can detect and fix the possible error. Instead of re-implementing it, GRIMER runs DECONTAM directly in R to keep compatibility and avoid any implementation differences. This is important due to the different way numeric values are treated in R and python. However, this brigde between languages can be tricky and may bring some issues. A bug report would be very helpful to improve the conversion code for general use cases.
Regarding the specification of negative and positive controls in the config.yaml, would it be possible for this to be implemented from the executable? For example, there could be a flag '--control-column' that specifies the column in the user's metadata file. '-control-column control' would parse the 'control' metadata column, and for cases where are values 'negative', 'positive' assign them automatically. This is just a suggestion that could make it a bit easier for users to set control samples, rather than having to create a new .txt file and change the config.yml.
---Answer ---Thanks for the suggestion. It is now possible to provide which samples are controls defining a metadata field and value(s) in the config.yml file in addition to a file containing the sample identifiers. Detailed information on how to set-up the controls are now in the manual: https://pirovc.github.io/grimer/config/ --------------Dependencies: When installing via conda, I ran into the following error: ImportError: cannot import name 'PearsonRConstantInputWarning' from 'scipy.stats' It seems that this can't be imported from later versions of scipy, but I managed to fix it by forcing scipy=1.8.1. You should be able to force this version in the conda recipe.
---Answer ---I tried different versions of scipy (1.9.3 and 1.10.0) and could not replicate this error using all GRIMER features. It would be nice to have it reported in the repository with your data so I can detect and fix the possible error.   analysis: basic data summaries, diversity and functional analysis, microbial interactions, differential abundance 46 among others. Additionally, interactive tools for analytical and visual exploration are extremely helpful in this stage to better understand the data distribution and its properties and to guide further investigations to follow. 48 In the last decade, several applications were developed with focus on visualization of microbiome data (Table   49 2  an ideal scenario for exogenous contaminants to out-compete and dominate the biological signal. 64 It is important that contamination is acknowledged, accounted for and discovered at the earliest stage of 65 a study prior to statistical analysis, to not bias measurements and to ensure that bias is not propagated into  interactive plots to better explore the data and to facilitate contamination detection. GRIMER integrates several 98 sources, references, analyses as well as external tools and brings them together in one concise dashboard.

99
The output of GRIMER is a self-contained HTML file that can be visualized in any modern web-browser.

100
It works independently from any actively running server or web-service. Once generated, it can be used and   to link findings to common contaminants or connect analyses outcomes with known environments or biomes.

119
Those entries can be easily provided by the user in a simple list of names or taxonomic identifiers in a formatted 120 and annotated file (more information can be found in the GRIMER repository).

121
Contamination references 122 We compiled an extensive list of possible contaminant taxa reported in several studies (   Table 3 at genus and species level. If multiple child nodes of organisms are reported in the same study, they are counted here just once.
Additionally, we compiled another list of common organisms found in probable external contamination

Input data
• Taxonomy: GRIMER will automatically parse a given taxonomic annotation or generate one based on 163 the provided observations. Data will be summarized in many taxonomic levels and plots will be created 164 accordingly. Taxonomy is fully automated for several commonly used taxonomies (NCBI, GTDB, SILVA, 165 GreenGenes, OTT).

166
• Controls: one or more groups of control samples can be provided in a simple text file. Those samples will 167 be further used to summarize data and annotate plots.

168
• References: custom sources of contamination or any references can be provided in addition to the pre-169 compiled ones described above.

170
GRIMER will parse and process the data provided and run a set of analyses: Further libraries were used to analyze samples and generate the report: pandas [74] for general parsing and 232 data structures, scipy [75] for hierarchical clustering, scikit-bio (http://scikit-bio.org/) for transformations.

236
GRIMER will automatically parse given taxonomies or download and convert any taxonomic id or name inter-  We re-analyzed the samples in a standard pipeline with QIIME2 [6] for amplicon data and ganon for MGS data

255
[79], generated a GRIMER report for both and searched for the previously detected contamination.

256
In the MGS report, the bar plot ( Figure 1) shows a stark difference in signal between sample types but a GRIMER is an easy-to-use and accessible tool for specialists and non-specialists that generates a concise interac-317 tive offline dashboard with a set of analyses, visualizations, and data connections from a simple table of counts.

318
It automatically summarizes several levels of evidence to better understand the relation between observations, 319 samples, metadata, and taxonomy. GRIMER reports are a valuable resource for investigating contamination, a 320 problem that affects every microbiome study to some degree.
All the conclusion and visualizations presented in this work in the results section were solely based on 322 GRIMER reports, showing that microbiome analysis, contamination investigation and detection are possible 323 with the methodology proposed. The use of multiple sources of evidence to annotate observations improves 324 the ability to better detect clear contaminants in microbiome studies as well as to point to probable groups of 325 candidate contaminants.

326
In addition to the GRIMER software, we compiled and provided in this work a list of common taxa con-327 taminants based on 22 publications (  was developed in a way that new visualizations can be included with little effort. 358 We listed and summarized a list of similar currently available methods published in the last 10 years (Table   359 2) as well as web-plataforms for complete analyses of microbiome data (  (Table 2). This may be impractical for many non-specialists and for long term storage and reproducibility.

370
GRIMER reports are portable and fully functional offline. This allows analysis to be accessible by many 371 researchers with different backgrounds working together in the same study, increasing direct interaction with 372 data. The portability also enables better documentation of results, reproducibility and shareability. Further, 373 web-based tools may disappear after some years of inactivity or lack of funding and analysis may be lost, as it 374 is the case for for some methods (Table 5). GRIMER reports are completely offline and will work as long as 375 the report file is safely stored.

376
Overall we believe that GRIMER is a valuable contribution to the microbiome field and can facilitate data 377 exploration, analysis and contamination detection. The datasets and metadata for the placenta study were obtained from:  c) d) Figure 3: Heatmap visualization at species level for the KatharoSeq data. Samples are grouped by study type (yaxis) and clustered by observations (x-axis, euclidean distance metric, complete method). Data in the heatmap is center-log ratio transformed. Bottom panel show annotation related to the observations. "Contaminants" and "Human-related" annotations are normalized counts against pre-compiled list of references described in this paper. "decontam" is the normalized DECONTAM p-score. All "control" annotations show the proportion of the observation in the indicated group of control samples.
Figure 4: Heatmap visualization at genus level for the KatharoSeq data. Samples and observations axis are clustered and sorted based on the euclidean distance metric, complete method. Data in the heatmap is center-log ratio transformed. Bottom panel show annotation related to the observations. "Contaminants" and "Humanrelated" annotations are normalized counts against pre-compiled list of references described in this paper. "decontam" is the normalized DECONTAM p-score. All "control" annotations show the proportion of the observation in the indicated group of control samples. Metadata panel show color-coded sample information on study (md title) and type of sample (md control verbose). The annotation panel shows higher values on multiple sources of evidence for contamination relative to data clusters of the heatmap. Metadata panel shows how samples show independent patterns based on the environment (md title) and difference from controls (md control verbose).